Structured Probabilistic Models for Natural Language Semantics

نویسنده

Dipanjan Das

چکیده

The recent past has witnessed a predominance of robust empirical methods in natural language structure prediction, but mostly through the analysis of syntax. Wide-coverage analysis of the underlying meaning or semantics of natural language utterances still remains a major obstacle for language understanding. A primary bottleneck lies in the scarcity of high-quality and large amounts of annotated data that provide complete information about the semantic structure of natural language expressions. In this thesis proposal, we study structured probabilistic models tailored to solve problems in computational semantics, with a focus on modeling structure that is not visible in annotated text data. First, we investigate the problem of paraphrase identification, which attempts to recognize whether two sentences convey the same meaning. Our approach towards solving this problem systematically blends natural language syntax and lexical semantics within a probabilistic framework, and achieves state-of-the-art accuracy on a standard corpus, when trained on a set of true and false sentential paraphrases (Das and Smith, 2009). Given a pair of sentences, the presented model recognizes the paraphrase relationship by predicting the structure of one sentence given the other, allowing loose syntactic transformation and lexical semantic alteration at the level of aligned words. Second, we focus on the problem of frame-semantic parsing. Frame semantics offers deep linguistic analysis that exploits the use of lexical semantic properties and relationships among semantic frames and roles. We describe probabilistic models for analyzing a sentence to produce a full framesemantic parse. Our models leverage the FrameNet (Fillmore et al., 2003) and WordNet (Fellbaum, 1998) lexica, a small corpus containing full text annotations of natural language sentences, syntactic representations from which we derive features, and results in significant improvements over previously published results on the same corpus (Das et al., 2010a). Unfortunately, the datasets used to train our paraphrase and frame-semantic models are too small to lead to robust performance. Therefore, to obviate this problem, a common trait in our methods is the hypothesis of hidden structure in the data. To this end, we employ conditional log-linear models over structures, that are firstly capable of incorporating a wide variety of features gathered from the data as well as various lexica, and secondly use latent variables to model missing information in annotated data. For the paraphrase problem, our model assumes the presence of hidden alignments between the syntactic structures of the sentence pair, which while unknown during the training and testing phases, produce meaningful correpondence between the two sentences’ syntax at inference time, as a by-product. For frame-semantic parsing, we face the challenge of identifying semantic frames for previously unseen lexical items. To generalize our model to these new lexical items, we adopt a stochastic process that assumes that a given semantic frame generates a latent lexical item, and the lexical item in turn generates the unseen item through a lexical semantic transformation process. Continuing with the theme of hypothesizing hidden structure in data for modeling of natural language semantics, we propose to leverage large volumes of unlabeled data to improve upon the aforementioned tasks. As an attempt to harvest raw data to boost semantic structure prediction, we intend to gather categorical and topical word clusters from a large corpus using standard clustering techniques that look at lexical and syntactic contexts around a given word. Inspired by recent advances in syntactic parsing (Koo et al., 2008) and information extraction (Miller et al., 2004; Lin and Wu, 2009), we propose to incorporate features based on these clusters in our existing models for paraphrase identification and frame-semantic parsing, hoping to resolve data sparsity to some extent

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

5 Probabilistic Relational Models

Probabilistic relational models (PRMs) are a rich representation language for structured statistical models. They combine a frame-based logical representation with probabilistic semantics based on directed graphical models (Bayesian networks). This chapter gives an introduction to probabilistic relational models, describing semantics for attribute uncertainty, structural uncertainty, and class ...

متن کامل

Popularity Teaching - Ability Course Difficulty Rating Instructor Registration Course

متن کامل

Semantics and Inference for Recursive Probability Models

In recent years, there have been several proposals that extend the expressive power of Bayesian networks with that of relational models. These languages open the possibility for the specification of recursive probability models, where a variable might depend on a potentially infinite (but finitely describable) set of variables. These models are very natural in a variety of applications, e.g., i...

متن کامل

Efficiency in Ambiguity: Two Models of Probabilistic Semantics for Natural Language

This paper explores theoretical issues in constructing an adequate probabilistic semantics for natural language. Two approaches are contrasted. The first extends Montague Semantics with a probability distribution over models. It has nice theoretical properties, but does not account for the ubiquitous nature of ambiguity; moreover inference is NP-hard. An alternative approach is described in whi...

متن کامل

Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

Parsing is important in Linguistics and Natural Language Processing to understand the syntax and semantics of a natural language grammar. Parsing natural language text is challenging because of the problems like ambiguity and inefficiency. Also the interpretation of natural language text depends on context based techniques. A probabilistic component is essential to resolve ambiguity in both syn...

متن کامل

Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Structured Probabilistic Models for Natural Language Semantics

نویسنده

چکیده

منابع مشابه

5 Probabilistic Relational Models

Popularity Teaching - Ability Course Difficulty Rating Instructor Registration Course

Semantics and Inference for Recursive Probability Models

Efficiency in Ambiguity: Two Models of Probabilistic Semantics for Natural Language

Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

عنوان ژورنال:

اشتراک گذاری